59 research outputs found
Recommended from our members
Gender and vocal production mode discrimination using the high frequencies for speech and singing
Humans routinely produce acoustical energy at frequencies above 6 kHz during vocalization, but this frequency range is often not represented in communication devices and speech perception research. Recent advancements toward high-definition (HD) voice and extended bandwidth hearing aids have increased the interest in the high frequencies. The potential perceptual information provided by high-frequency energy (HFE) is not well characterized. We found that humans can accomplish tasks of gender discrimination and vocal production mode discrimination (speech vs. singing) when presented with acoustic stimuli containing only HFE at both amplified and normal levels. Performance in these tasks was robust in the presence of low-frequency masking noise. No substantial learning effect was observed. Listeners also were able to identify the sung and spoken text (excerpts from “The Star-Spangled Banner”) with very few exposures. These results add to the increasing evidence that the high frequencies provide at least redundant information about the vocal signal, suggesting that its representation in communication devices (e.g., cell phones, hearing aids, and cochlear implants) and speech/voice synthesizers could improve these devices and benefit normal-hearing and hearing-impaired listeners
Time-Varying Quasi-Closed-Phase Analysis for Accurate Formant Tracking in Speech Signals
In this paper, we propose a new method for the accurate estimation and
tracking of formants in speech signals using time-varying quasi-closed-phase
(TVQCP) analysis. Conventional formant tracking methods typically adopt a
two-stage estimate-and-track strategy wherein an initial set of formant
candidates are estimated using short-time analysis (e.g., 10--50 ms), followed
by a tracking stage based on dynamic programming or a linear state-space model.
One of the main disadvantages of these approaches is that the tracking stage,
however good it may be, cannot improve upon the formant estimation accuracy of
the first stage. The proposed TVQCP method provides a single-stage formant
tracking that combines the estimation and tracking stages into one. TVQCP
analysis combines three approaches to improve formant estimation and tracking:
(1) it uses temporally weighted quasi-closed-phase analysis to derive
closed-phase estimates of the vocal tract with reduced interference from the
excitation source, (2) it increases the residual sparsity by using the
optimization and (3) it uses time-varying linear prediction analysis over long
time windows (e.g., 100--200 ms) to impose a continuity constraint on the vocal
tract model and hence on the formant trajectories. Formant tracking experiments
with a wide variety of synthetic and natural speech signals show that the
proposed TVQCP method performs better than conventional and popular formant
tracking tools, such as Wavesurfer and Praat (based on dynamic programming),
the KARMA algorithm (based on Kalman filtering), and DeepFormants (based on
deep neural networks trained in a supervised manner). Matlab scripts for the
proposed method can be found at: https://github.com/njaygowda/ftrac
Recommended from our members
The perceptual significance of high-frequency energy in the human voice
While human vocalizations generate acoustical energy at frequencies up to (and beyond) 20 kHz, the energy at frequencies above about 5 kHz has traditionally been neglected in speech perception research. The intent of this paper is to review (1) the historical reasons for this research trend and (2) the work that continues to elucidate the perceptual significance of high-frequency energy (HFE) in speech and singing. The historical and physical factors reveal that, while HFE was believed to be unnecessary and/or impractical for applications of interest, it was never shown to be perceptually insignificant. Rather, the main causes for focus on low-frequency energy appear to be because the low-frequency portion of the speech spectrum was seen to be sufficient (from a perceptual standpoint), or the difficulty of HFE research was too great to be justifiable (from a technological standpoint). The advancement of technology continues to overcome concerns stemming from the latter reason. Likewise, advances in our understanding of the perceptual effects of HFE now cast doubt on the first cause. Emerging evidence indicates that HFE plays a more significant role than previously believed, and should thus be considered in speech and voice perception research, especially in research involving children and the hearing impaired
High-resolution three-dimensional hybrid MRI + low dose CT vocal tract modeling:A cadaveric pilot study
SummaryObjectivesMRI based vocal tract models have many applications in voice research and education. These models do not adequately capture bony structures (e.g. teeth, mandible), and spatial resolution is often relatively low in order to minimize scanning time. Most MRI sequences achieve 3D vocal tract coverage at gross resolutions of 2 mm3 within a scan time of <20 seconds. Computed tomography (CT) is well suited for vocal tract imaging, but is infrequently used due to the risk of ionizing radiation. In this cadaveric study, a single, extremely low-dose CT scan of the bony structures is blended with accelerated high-resolution (1 mm3) MRI scans of the soft tissues, creating a high-resolution hybrid CT-MRI vocal tract model.MethodsMinimum CT dosages were determined and a custom 16-channel airway receiver coil for accelerated high (1 mm3) resolution MRI was evaluated. A rigid body landmark based partial volume registration scheme was then applied to the images, creating a hybrid CT-MRI model that was segmented in Slicer.ResultsUltra-low dose CT produced images with sufficient quality to clearly visualize the bone, and exposed the cadaver to 0.06 mSv. This is comparable to atmospheric exposures during a round trip transatlantic flight. The custom 16-channel vocal tract coil produced acceptable image quality at 1 mm3 resolution when reconstructed from ∼6 fold undersampled data. High (1 mm3) resolution MR imaging of short (<10 seconds) sustained sounds was achieved. The feasibility of hybrid CT-MRI vocal tract modeling was successfully demonstrated using the rigid body landmark based partial volume registration scheme. Segmentations of CT and hybrid CT-MRI images provided more detailed 3D representations of the vocal tract than 2 mm3 MRI based segmentations.ConclusionsThe method described in this study indicates that high-resolution CT and MR image sets can be combined so that structures such as teeth and bone are accurately represented in vocal tract reconstructions. Such scans will aid learning and deepen understanding of anatomical features that relate to voice production, as well as furthering knowledge of the static and dynamic functioning of individual structures relating to voice production
Recommended from our members
Arizona Child Acoustic Database: Participant Table
The Arizona Child Acoustic Database consists of longitudinal audio recordings from a group of children over a critical period of growth and development (ages 2-7 years). The goal of this database is to 1) document acoustic changes in speech production that may be related to physical growth 2) inform development of a model of speech production for child talkers. This work was funded by NSF BSC-1145011 awarded to Kate Bunton, Ph.D. and Brad Story, Ph.D, Principal Investigators.
This database contains longitudinal audio recordings of 55 American English speaking children between the ages of 2-7 at 3-month intervals. Since children began the study at different ages, some children have fewer recording sessions than others. The database can also be used to provide cross-sectional data for children of a specific age. Please refer to the subject data table for information on specific sessions available here http://arizona.openrepository.com/arizona/handle/10150/316065.
All children were recorded using the same protocol; therefore, task numbers are consistent across children and sessions. A calibration tone is included as Record 1 for all sessions. The speech protocol focused on production of English monopthong and diphthong vowels in isolation, sVd, hVd, and monosyllabic real words. In addition, the protocol includes several nonsense vowel-to-vowel transitions. Speakers were prompted either verbally by investigators or by graphical prompts. Details of the protocol with reference to task numbers can be found in the protocol spreadsheet available here http://arizona.openrepository.com/arizona/handle/10150/316065.
Details on data recording:
All samples were recorded digitally using an AKG SE 300B microphone with a mouth to mic distance of approximately 10 inches. Signals were recorded digitally using a Marantz PMD671, 16 bit PCM (uncompressed) at 44.1KHz. Recordings are made available in .wav format. Individual zip files contain all recordings from a single session.NSF BSC-114501
An approach to explaining formants (Story, 2024)
Purpose: This tutorial is a description of a possible approach to teaching the concept of formants to students in a speech science course, at either the undergraduate or graduate level. The approach is to explain formants as prominent regions of energy in the output spectrum envelope radiated at the lips, and how they arise as the superposition of vocal tract resonances on a source signal. Standing waves associated with vocal tract resonances are briefly explained and standing wave animations are provided. Animations of the temporal variation of the vocal tract, vocal tract resonances, spectra, and spectrograms, along with audio samples are included to provide dynamic demonstrations of the concept of formants.Conclusions: The explanations, accompanying demonstrations, and suggested activities are intended to provide a launching point for understanding formants and how they can be measured, analyzed, and interpreted. As a result, participants should be able to describe the meaning of the term “formant” as it relates to a spectrum and a spectrogram, explain the difference between formants and vocal tract resonances, explain how vocal tract resonances combined with the voice source generate formants, and identify formants in both narrow-band and wide-band spectrograms and track their time-varying patterns with a formant tracking algorithm.Supplemental Material S1. Standing wave in neutral vocal tract configuration for the first resonance.Supplemental Material S2. Standing wave in neutral vocal tract configuration for the second resonance.Supplemental Material S3. Standing wave in neutral vocal tract configuration for the third resonance.Supplemental Material S4. Pressure distribution in neutral vocal tract configuration at 1000 Hz, off resonance.Supplemental Material S5. Animation of the temporal variation of the components of the source-filter representation during production of “Hello, how are you.” The animation also includes an audio track that is a slowed version of the phrase generated by the TubeTalker model.Supplemental Material S6. Audio file containing the real-time voice source signal (glottal flow wave) generated during the TubeTalker simulation of “Hello, how are you.”Supplemental Material S7. Audio file containing the real-time output pressure signal generated during the TubeTalker simulation of “Hello, how are you.”Supplemental Material S8. Animation of the temporal variation of the vocal tract in two representations during production of “Hello, how are you.” In the upper inset plot, the vocal tract is shown in tubular form, and in the main plot in the middle the vocal tract is shown in a pseudo-midsagittal form. The lower inset plot shows the simultaneous temporal variation of the frequency response function (resonances). The animation also includes an audio track that is a slowed version of the phrase generated by the TubeTalker model.Supplemental Material S9. Animation of the temporal variation of the frequency response function in three-dimensions (time, frequency, amplitude) during production of “Hello, how are you.” There is a delay in middle of the animation to allow the viewer to see the full history and then the view rotates into a traditional spectrographic perspective. The animation also includes an audio track that is a slowed version of the phrase generated by the TubeTalker model.Supplemental Material S10. Animation of the temporal variation of narrow-band spectra in three-dimensions (time, frequency, amplitude) during production of “Hello, how are you.” There is a delay in middle of the animation to allow the viewer to see the full history and then the view rotates into a traditional spectrographic perspective. The animation also includes an audio track that is a slowed version of the phrase generated by the TubeTalker model.Story, B. H. (2024). An approach to explaining formants. Perspectives of the ASHA Special Interest Groups. Advance online publication. https://doi.org/10.1044/2023_PERSP-23-00200</p
Recommended from our members
Arizona Child Acoustic Database: Task List
The Arizona Child Acoustic Database consists of longitudinal audio recordings from a group of children over a critical period of growth and development (ages 2-7 years). The goal of this database is to 1) document acoustic changes in speech production that may be related to physical growth 2) inform development of a model of speech production for child talkers. This work was funded by NSF BSC-1145011 awarded to Kate Bunton, Ph.D. and Brad Story, Ph.D, Principal Investigators.
This database contains longitudinal audio recordings of 55 American English speaking children between the ages of 2-7 at 3-month intervals. Since children began the study at different ages, some children have fewer recording sessions than others. The database can also be used to provide cross-sectional data for children of a specific age. Please refer to the subject data table for information on specific sessions available here http://arizona.openrepository.com/arizona/handle/10150/316065.
All children were recorded using the same protocol; therefore, task numbers are consistent across children and sessions. A calibration tone is included as Record 1 for all sessions. The speech protocol focused on production of English monopthong and diphthong vowels in isolation, sVd, hVd, and monosyllabic real words. In addition, the protocol includes several nonsense vowel-to-vowel transitions. Speakers were prompted either verbally by investigators or by graphical prompts. Details of the protocol with reference to task numbers can be found in the protocol spreadsheet available here http://arizona.openrepository.com/arizona/handle/10150/316065.
Details on data recording:
All samples were recorded digitally using an AKG SE 300B microphone with a mouth to mic distance of approximately 10 inches. Signals were recorded digitally using a Marantz PMD671, 16 bit PCM (uncompressed) at 44.1KHz. Recordings are made available in .wav format. Individual zip files contain all recordings from a single session.NSF BSC-114501
A model of speech production based on the acoustic relativity of the vocal tract
A model is described in which the effects of articulatory movements to produce speech are generated by specifying relative acoustic events along a time axis. These events consist of directional changes of the vocal tract resonance frequencies that, when associated with a temporal event function, are transformed via acoustic sensitivity functions, into time-varying modulations of the vocal tract shape. Because the time course of the events may be considerably overlapped in time, coarticulatory effects are automatically generated. Production of sentence-level speech with the model is demonstrated with audio samples and vocal tract animations. (C) 2019 Acoustical Society of America.6 month embargo; published online: 17 October 2019This item from the UA Faculty Publications collection is made available by the University of Arizona with support from the University of Arizona Libraries. If you have questions, please contact us at [email protected]
- …